Bi-order multimodal integration of single-cell data

Integration of single-cell multiomics profiles generated by different single-cell technologies from the same biological sample is still challenging. Previous approaches based on shared features have only provided approximate solutions. Here, we present a novel mathematical solution named bi-order canonical correlation analysis (bi-CCA), which extends the widely used CCA approach to iteratively align the rows and the columns between data matrices. Bi-CCA is generally applicable to combinations of any two single-cell modalities. Validations using co-assayed ground truth data and application to a CAR-NK study and a fetal muscle atlas demonstrate its capability in generating accurate multimodal co-embeddings and discovering cellular identity. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02679-x.


Supplementary Note 1 Previous studies on multi-omics integration 28
A recent study [1] evaluated 14 single-cell batch-effect correction/integration methods, and 29 recommended Harmony [2], LIGER [3], and Seurat3.0 [4]. In addition, we included four manifold-30 alignment based methods Panoma [5], SCOT [6], UnionCom [7] and MMD-MA [8] for 31 comparison. 32 Harmony 33 Harmony [2] uses an iterative clustering approach to align cells from different batches. The 34 algorithm first combines the batches and projects the data into a dimensionally reduced space using 35 PCA. It then uses an iterative procedure to remove the multi-dataset specific effects. In our analysis, 36 we ran Harmony within the Seurat3.0 based on the guide 37 dimensional space where each gene is characterized by two sets of factors. The first set contains 49 3 dataset-specific factors, and the second contains shared factors. The shared factor space is then 50 used to identify similar cell types across datasets by first constructing a shared factor neighborhood 51 graph to connect cells with similar factor loading patterns. Joint clusters are then identified using 52 the Louvain community detection algorithm. Thereafter, the factor loading quantiles are 53 normalized using the largest data batch as a reference to achieve batch-correction. In our work, we 54 followed the LIGER documentation 55 (http://htmlpreview.github.io/?https://github.com/MacoskoLab/liger/blob/master/vignettes/Integr 56 ating_scRNA_and_scATAC_data.html). For preprocessing, we used the LIGER preprocessing 57 functions, where we first selected genes with high variances. We then performed iNMF-based 58 factorization using an alternating least squares algorithm, followed by data alignment using joint 59 clustering and quantile alignment. 60

MMD-MA 61
MMD-MA performs multiomic data integration by optimizing the objective function with three 62 components: 1) a maximum mean discrepancy term to make the differently measured points to 63 have similar distributions in the latent space based on the kernel Gram matrices; 2) a distortion 64 term to preserve the structure of the data between the input space and the latent space; and 3) a 65 penalty term to ensure that distortion between the data in the original space and the data mapped 66 to the low-dimensional space as small as possible. The transformation to the latent space is based 67 on Gaussian kernel, and thus solely retains cell-cell distance information. 68 UnionCom 69 UnionCom first uses the cell-cell distance matrix of each dataset to represent its manifold. It then 70 aligns the cells across single-cell multi-omics datasets by matching the distance matrices by an 71 extended version of the unsupervised manifold alignment method GUMA. Finally, it projects the 72 4 distinct unmatched features across single-cell multi-omics datasets by matching the distance 73 matrices via a matrix optimization method. 74

SCOT 75
Like UnionCom, SCOT (single-cell alignment using optimal transport) aims to preserve local 76 geometry when aligning single-cell data. The algorithm achieves this by constructing a k-nearest 77 neighbor graph for each dataset. SCOT uses Gromov-Wasserstein optimal transport to find a 78 probabilistic coupling between the samples of each dataset. Finally. It uses the coupling matrix to 79 project one single-cell dataset onto another through barycentric projection, thus aligning them. 80

81
Pamona 82 Pamona formulates the single cell multi-omics datasets as the partial manifold alignment problem 83 and solves it under the partial Gromov-Wasserstein optimal transport framework. As the "partial" 84 in its name suggested, it is more flexible in handling sample-specific cells than SCOT. The 85 integration includes three steps: 1) constructs a weighted k-NN graph of cells of each dataset and 86 computes the geodesic distance matrix of cells within each dataset; 2) calculates probabilistic 87 coupling matrices of cells based on the partial Gromov-Wasserstein optimal transport; and 3) 88 aligns cellular modalities with distinct unmatched features in a common low-dimensional space. 89

Supplementary Note 2 Simulation Study 90
Existing integration methods such as Seurat, LIGER, and Harmony rely on pre-aligning features 91 across modalities, i.e., compressing cell-peak matrices obtained from scATAC-seq onto cell-gene We systematically simulated a series of datasets to assess how different combinations of proximal 103 and distal regulatory elements affect the integration. In brief, 1,500 scATAC-seq peaks for 2,000 104 cells were simulated using Splatter [10] (Additional file 1: Fig. S2a). The cells were uniformly 105 drawn into 3 types, resulting in = 668, 701, and 631, respectively. Then, 500 gene expressions 106 were calculated as the weighted mean of both proximal and distal peaks. Each gene is regulated 107 by three proximal peaks and two distal peaks. We varied the relative strength of distal peaks, , 108 from 0.2 to 0.9, and the strength of proximal peaks is then (1 − ). The gene activity matrix, 109 initialized from peaks in gene bodies, was an unweighted sum among the proximal peaks 110 (Additional file 1: Fig. S2b). Given the simulated scATAC-seq and scRNA-seq data, we used 111 various methods to integrate the two datasets. A good approach should be able to generate co-112 6 embeddings in which the cells from the two datasets are mixed according to their cell type 113 identities. As shown in Additional file 1: Fig. S2c, bindSC successfully integrated the datasets 114 under all the settings. Seurat maintained the overall population structure but was not able to mix 115 the data well. Its performance further deteriorated as increases. Similarly, Harmony was able to 116 maintain the overall population structure but mis-assigned cells to incorrect clusters (mixing colors) 117 as increases. Most of the other methods failed to produce co-embeddings of the expected 118 population structure. For example, UnionCom got confused when matching cell types of similar 119 abundance and swapped cell types (blue and red in Additional file 1: Fig. S2c, = 0.7). 120

121
We also found that the estimated matrix Z achieved a high correlation with the ground truth matrix 122 within the first 5 iterations (Additional file 1: Fig. S2d). Notably, the correlation was initially 123 worse for larger (i.e., distal regulation dominant) but quickly improved as Z got updated. 124 Although the real regulatory relations between ATAC peaks and gene expressions are considerably 125 more complex and dynamic, this simple simulation experiment proves that bindSC has a clear edge 126 over other methods in integrating modalities and inferring underlying regulatory relations across 127 a range of conditions. 128

129
To further test the robustness of bindSC w.r.t. variable data dimensions, we varied the number of 130 ATAC peaks from 1,000 to 10,000 while keeping the number of genes at 500. This simulated a 131 range of two to 20-fold imbalance between the numbers of features in the two modalities. The 132 results indicated the robustness of bindSC over the range of the parameters (Additional file1: Fig.  133 S2e). 134